Dual-channel generative pretraining for learning natural turn-taking in spoken dialogue without labeled data. A 0.5B model that outperforms models 6x its size on turn prediction.
#voice ai
Content tagged with "voice ai"
A multimodal speech LLM that processes audio directly to enhance conversational AI while reducing overhead compared to traditional ASR-LLM-TTS pipelines.
Two techniques for incorporating prosody into end-to-end SLU: prosody-attention and prosody-distillation. Up to 8% intent classification accuracy improvement on SLURP.
Map-Mix: a data augmentation approach using model training dynamics to guide latent mixup sampling, giving ~2% weighted F1 improvement on low-resource dialect classification.
The first public Indian-accented SLU dataset in the banking domain. SSL speech representations beat ASR-based approaches for intent classification.
How deep learning models can isolate independent factors of variation in data through VAEs and Beta-TCVAE, enabling controlled synthesis and better downstream representations.
A semi-supervised framework for speaker profiling that leverages external unlabelled corpora via supervised, unsupervised, and consistency training, achieving RMSE of 6.8 years on age estimation.